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Abstract 

Supervised learning deals with the inference of a distribution over an 
output or label space y conditioned on points in an observation space X, 
given a training dataset D of pairs in X x y. However, in a lot of applica- 
tions of interest, acquisition of large amounts of observations is easy, while 
the process of generating labels is time-consuming or costly. One way to 
deal with this problem is active learning, where points to be labelled are 
selected with the aim of creating a model with better performance than 
that of an model trained on an equal number of randomly sampled points. 
In this paper, we instead propose to deal with the labelling cost directly: 
The learning goal is defined as the minimisation of a cost which is a func- 
tion of the expected model performance and the total cost of the labels 
used. This allows the development of general strategies and specific algo- 
rithms for (a) optimal stopping, where the expected cost dictates whether 
label acquisition should continue (b) empirical evaluation, where the cost 
is used as a performance metric for a given combination of inference, 
stopping and sampling methods. Though the main focus of the paper 
is optimal stopping, we also aim to provide the background for further 
developments and discussion in the related field of active learning. 



1 Introduction 

Much of classical machine learning deals with the case where we wish to learn a 
target concept in the form of a function / : X — ► y, when all we have is a finite 
set of examples D — {(xi, yi)}f =1 . However, in many practical settings, it turns 
out that for each example i in the set only the observations Xi are available, while 
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the availability of observations yi is restricted in the sense that either (a) they 
are only observable for a subset of the examples (b) further observations may 
only be acquired at a cost. In this paper we deal with the second case, where 
we can actually obtain labels for any i G D, but doing so incurs a cost. Active 
learning algorithms (i.e. [1, 2]) deal indirectly with this by selecting examples 
which are expected to increase accuracy the most. However, the basic question 
of whether new examples should be queried at all is seldom addressed. 

This paper deals with the labelling cost explicitly. We introduce a cost func- 
tion that represents the trade-off between final performance (in terms of gener- 
alisation error) and querying costs (in terms of the number of labels queried) . 
This is used in two ways. Firstly as the basis for creating cost-dependent stop- 
ping rules. Secondly, as the basis of a comparison metric for learning algorithms 
and associated stopping algorithms. 

To expound further, we decide when to stop by estimating the expected 
performance gain from querying additional examples and comparing it with the 
cost of acquiring more labels. One of the main contributions is the develop- 
ment of methods for achieving this in a Bayesian framework. While due to the 
nature of the problem there is potential for misspecification, we nevertheless 
show experimentally that the stopping times we obtain are close to the optimal 
stopping times. 

We also use the trade-off in order to address the lack of a principled method 
for comparing different active learning algorithms under conditions similar to 
real-world usage. For such a comparison a method for choosing stopping times 
independently of the test set is needed. Combining stopping rules with active 
learning algorithms allows us to objectively compare active learning algorithms 
for a range of different labelling costs. 

The paper is organised as follows. Section 1.1 introduces the proposed cost 
function for when labels are costly, while Section 1.2 discusses related work. Sec- 
tion 2 derives a Bayesian stopping method that utilises the proposed cost func- 
tion. Some experimental results illustrating the proposed evaluation methodol- 
ogy and demonstrating the use of the introduced stopping method are presented 
in Section 3. The proposed methods arc not flawless, however. For example, the 
algorithm-independent stopping rule requires the use of i.i.d. examples, which 
may interfere with its coupling to an active learning algorithm. We conclude 
with a discussion on the applicability, merits and deficiencies of the proposed 
approach to optimal stopping and of principled testing for active learning. 

1.1 Combining Classification Error and Labelling Cost 

There are many applications where raw data is plentiful, but labelling is time 
consuming or expensive. Classic examples are speech and image recognition, 
where it is easy to acquire hours of recordings, but for which transcription and 
labelling are laborious and costly. For this reason, we are interested in querying 
labels from a given dataset such that we find the optimal balance between the 
cost of labelling and the classification error of the hypothesis inferred from the 
labelled examples. This arises naturally from the following cost function. 
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Let some algorithm F which queries labels for data from some unlabelled 
datasct D, incurring a cost 7 e [0, 00) for each query. If the algorithm stops 
after querying labels of examples d\,dz,..., d t , with di G [1, |D|].it will suffer a 
total cost of jt, plus a cost depending on the generalisation error. Let f(t) be 
the hypothesis obtained after having observed t examples and corresponding to 
the generalisation error E[i?|/(i)] be the generalisation error of the hypothesis. 
Then, we define the total cost for this specific hypothesis as 

E[C 7 |/(t)]=E[iJ|/(t)]+7i. (1) 

We may use this cost as a way to compare learning and stopping algorithms, 
by calculating the expectation of C 7 conditioned on different algorithm combi- 
nations, rather than on a specific hypothesis. 

In addition, this cost function can serve as a formal framework for active 
learning. Given a particular dataset D, the optimal subset of examples to be 
used for training will be D* = argmin^ E(i?|F, A) + 7|A|- The ideal, but 
unrealisable, active learner in this framework would just use labels of the subset 
D* for training. 

Thus, these notions of optimality can in principle be used both for deriving 
stopping and sampling algorithms and for comparing them. Suitable metrics of 
expected real- world performance will be discussed in the next section. Stopping 
methods will be described in Section 2. 

1.2 Related Work 

In the active learning literature, the notion of an objective function for trading 
off classification error and labelling cost has not yet been adopted. However, a 
number of both qualitative and quantitative metrics were proposed in order to 
compare active learning algorithms. Some of the latter are defined as summary 
statistics over some subset T of the possible stopping times. This is problematic 
as it could easily be the case that there exists Ti,T 2 with 7i c T 2 , such that 
when comparing algorithms over 71 we get a different result than when we are 
comparing them over a larger set 7~2. Thus, such measures are not easy to 
interpret since the choice of 7" remains essentially arbitrary. Two examples are 
(a) the percentage reduction in error, where the percentage reduction in error 
of one algorithm over another is averaged over the whole learning curve [3, 4] 
and (b) the average number of times one algorithm is significantly better than 
the other during an arbitrary initial number of queries, which was used in [5]. 
Another metric is the data utilisation ratio used in [5, 4, 6], which is the amount 
of data required to reach some specific error rate. Note that the selection of the 
appropriate error rate is essentially arbitrary; in both cases the concept of the 
target error rate is utilised, which is the average test error when almost all the 
training set has been used. 

Our setting is more straightforward, since we can use (1) as the basis for 
a performance measure. Note that we are not strictly interested in comparing 
hypotheses /, but algorithms F. In particular, we can calculate the expected 
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cost given a learning algorithm F and an associated stopping algorithm Qf{i), 
which is used to select the stopping time T . From this follows that the expected 
cost of F when coupled with Qf{i) is 

We ( 7 ,F,Q F ) ee E[C 7 |F,Q F ( 7 )] = ^(E[i?|/(i)] +jt)P[T = t | F,Q F ( 7 )] (2) 

t 

By keeping one of the algorithms fixed, we can vary the other in order to 
obtain objective estimates of their performance difference. In addition, we may 
want to calculate the expected performance of algorithms for a range of values 
of 7, rather than a single value, in a manner similar to what [7] proposed as an 
alternative to ROC curves. This will require a stopping method Qf{i) which 
will ideally stop querying at a point that minimises E(C 7 ). 

The stopping problem is not usually mentioned in the active learning liter- 
ature and there are only a few cases where it is explicitly considered. One such 
case is [2], where it is suggested to stop querying when no example lies within 
the SVM margin. The method is used indirectly in [8] , where if this event oc- 
curs the algorithm tests the current hypothesis 1 , queries labels for a new set of 
unlabelled examples 2 and finally stops if the error measured there is below a 
given threshold; similarly, [9] introduced a bounds-based stopping criterion that 
relies on an allowed error rate. These are reasonable methods, but there exists 
no formal way of incorporating the cost function considered here within them. 
For our purpose we need to calculate the expected reduction in classification 
error when querying new examples and compare it with the labelling cost. This 
fits nicely within the statistical framework of optimal stopping problems. 

2 Stopping Algorithms 

An optimal stopping problem under uncertainty is generally formulated as fol- 
lows. At each point in time t, the experimenter needs to make a decision a E A, 
for which there is a loss function C(a\w) defined for all w € 0, where f2 is the 
set of all possible universes. The experimenter's uncertainty about which w G £1 
is true is expressed via the distribution P(w|£ t ), where £ t represents his belief 
at time t. The Bayes risk of taking an action at time t can then be written 
as /?o(£t) = rnin a ^2 w C(a, w) P(u>|£ t ). Now, consider that instead of making an 
immediate decision, he has the opportunity to take k more observations from 
a sample space S , at a cost of 7 per observation, thus allowing him to update 
his belief to P(w\£ t +k) = P(w\Dk, £t). What the experimenter must do in order 
to choose between immediately making a decision a and continuing sampling, 
is to compare the risk of making a decision now with the cost of making k ob- 
servations plus the risk of making a decision after k timesteps, when the extra 
data would enable a more informed choice. In other words, one should stop and 

1 i.e. a classifier for a classification task 

2 Though this is not really an i.i.d. sample from the original distribution except when \D\ — t 
is large. 
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make an immediate decision if the following holds for all k: 



Po(&) < 7^ 



p(D k = s\£ t ) min 



2£(a,«;)P(i«|D fc = s,^) 



ds. (3) 



We can use the same formalism in our setting. In one respect, the problem is 
simpler, as the only decision to be made is when to stop and then we just use 
the currently obtained hypothesis. The difficulty lies in estimating the expected 
error. Unfortunately, the metrics used in active learning methods for selecting 
new examples (see [5] for a review) do not generally include calculations of the 
expected performance gain due to querying additional examples. 

There are two possibilities for estimating this performance gain. The first 
is an algorithm- independent method, described in detail in Sec. 2.1, which uses 
a set of convergence curves, arising from theoretical convergence properties. 
We employ a Bayesian framework to infer the probability of each convergence 
curve through observations of the error on the next randomly chosen example 
to be labelled. The second method, outlined in Sec. 4, relies upon a classifier 
with a probabilistic expression of its uncertainty about the class of unlabelled 
examples, but is much more computationally expensive. 



2.1 When no Model is Perfect: Bayesian Model Selection 

The presented Bayesian formalism for optimal sequential decisions follows [10]. 
We require maintaining a belief £ t in the form of a probability distribution over 
the set of possible universes w S Q. Furthermore, we require the existence of 
a well-defined cost for each w. Then we can write the Bayes risk as in the left 
side of (3), but ignoring the minimisation over A as there is only one possible 
decision to be made after stopping, 

po(£t) = E(i? t | 6) = E I w ) I ( 4 ) 

wen 

which can be extended to continuous measures without difficulty. We will write 
the expected risk according to our belief at time t for the optimal procedure 
taking at most k more samples as 

Pk+i(&) =min{po(6),E[p fe (£t+i) 6] + T> ■ (5) 

This implies that at any point in time t, we should ignore the cost for the t 
samples we have paid for and are only interested in whether we should take 
additional samples. The general form of the stopping algorithm is defined in 
Alg. 1. Note that the horizon K is a necessary restriction for computability. A 
larger value of K leads to potentially better decisions, as when K — > oo, the 
bounded horizon optimal decision approaches that of the optimal decision in the 
unbounded horizon setting, as shown for example in Chapter 12 of [10]. Even 
with finite K > 1, however, the computational complexity is considerable, since 
we will have to additionally keep track of how our future beliefs P(w | £t+k) will 
evolve for all k < K. 
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Algorithm 1 A general bounded stopping algorithm using Bayesian inference. 
Given a dataset D and any learning algorithm F, an initial belief P(u> | £ ) an d 
a method for updating it, and additionally a known query cost 7, and a horizon 
K, 

1: for t= 1,2,... do 

2: Use F to query a new example i € D and obtain f(t). 
3: Observe the empirical error estimate Vt for /(t). 
4: Calculate P(w I Ct) = p (w I ft, 6-1) 
5: if J k G : /9fe(6) < Po(6) then 

6: Exit. 

7: end if 
8: end for 



2.2 The OBSV Algorithm 

In this paper we consider a specific one-step bounded stopping algorithm that 
uses independent validation examples for observing the empirical error estimate 
r t , which we dub OBSV and is shown in detail in Alg. 2. The algorithm considers 
hypotheses w 6 which model how the generalisation error r t of the learning 
algorithm changes with time. We assume that the initial error is ro and that the 
algorithm always converges to some unknown r x = lim^^r^ Furthermore, 
we need some observations v t that will allow us to update our beliefs over ft. 
The remainder of this section discusses the algorithm in more detail. 

2.2.1 Steps 1-5, 11-12. Initialisation and Observations 

We begin by splitting the training set D in two parts: Da, which will be sam- 
pled without replacement by the active learning algorithm (if there is one) and 
Dr, which will be uniformly sampled without replacement. This condition is 
necessary in order to obtain i.i.d. samples for the inference procedure outlined 
in the next section. However, if we only sample randomly, and we are not using 
an active learning algorithm then we do not need to split the data and we can 
set D A = 0. 

At each timestep t, we will use a sample from Dr to update p(w). If we then 
expect to reduce our future error sufficiently, we will query an example from Da 
using F and subsequently update the classifier / with both examples. Thus, 
not only are the observations used for inference independent and identically 
distributed, but we are also able to use them to update the classifier /. 

2.2.2 Step 6. Updating the Belief 

We model the learning algorithm as a process which asymptotically converges 
from an initial error ro to the unknown final error r m . Each model w will be a 
convergence estimate, a model of how the error converges from the initial to the 
final error rate. More precisely, each w corresponds to a function h w : N — > [0, 1] 
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that models how close we are to convergence at time t. The predicted error at 
time t according to w, given the initial error ro and the final error r 00l will be 

9w(t I n^roo) = r h w (t) + r x [l - h w {t)\. (6) 

We find it reasonable to assume that p(w,ro,r oc ) — p(w)p(ro)p(r oc ) 1 i.e. that 
the convergence rates do not depend upon the initial and final errors. 

We may now use these predictions together with some observations to update 
p(w, J-oo|£). More specifically, if P[r t = g w (t | ro,^) | 

r a > r oo j w] — 1 and we 

take m t independent observations z t = (z t (l) 7 z t (2), . . . , z t (m t )) of the error 
with mean vt, the likelihood will be given by the Bernoulli density 

p(z t | w, ro,^) = (g w (t I TQ^r^y^l -g w (t \ r , r oc )] 1 ~ Vt ) mt . (7) 

Then it is simple to obtain a posterior density for both w and r^, 



p(w I z t ) = ^y— y / p(z t I w,r , roo = u)p(roo = u I w)d?J 
p{roo I zt) = / p(z t \w,r , roo) p(w | r^) dw. (8b) 



(8a) 



pK) J si 

Starting with a prior distribution p(w | £0) and p(roo | £0), we may sequen- 
tially update our belief using (8) as follows: 

P(w I 6+1) =p(w I z t ,£ t ) (9a) 
M r oc I 6+1) =P(roo I z t ,&). (9b) 

The realised convergence for a particular training data set may differ sub- 
stantially from the expected convergence: the average convergence curve will 
be smooth, while any specific instantiation of it will not be. More formally, 
the realised error given a specific training dataset is q t = E[Rt | D l ), where 
D l ~ D*, while the expected error given the data distribution is r t = ~E[Rt] = 
J st E[i? t I D l ] P(D t ) dD t . The smooth convergence curves that we model would 
then correspond to models for r t . 

Fortunately, in our case we can estimate a distribution over r t without having 
to also estimate a distribution for q t , as this is integrated out for observations 

ze{0,l} 

p(z I q t ) = q* (1 - qt) 1 -* (10a) 



P( z r t) = / p(z I qt)p(qt =u\r t )du = r z t (1 - r t f~ 
Jo 



(10b) 



2.2.3 Step 5. Deciding whether to Stop 



We may now use the distribution over the models to predict the error should 
we choose to add k more examples. This is simply 

E[i?t+fc 16]=/ / 9w(t + k I 7-0,7-00)25(71; I 6)p(t-oo I £t) dwdr^. 
Jo Jn 

The calculation required for step 8 of OBSV follows trivially. 
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Algorithm 2 OBSV, a specific instantiation of the bounded stopping algorithm. 



Given a dataset D with examples in iV c classes and any learning algorithm F, 
initial beliefs P(w | £ ) an d P(roc | £o) and a method for updating them, and 
additionally a known query cost 7 for discovering the class label yi G [1, . . . , n] 
of example i e D, 

l: Split D into D A ,D R . 

2: r = 1 - l/iV c . 

3: Initialise the classifier /. 

4: for t = 1,2, . .. do 

5: Sample i G -Di? without replacement and observe f{xi), yi to calculate Wj. 
6: Calculate P(w,r oc | £ t ) = P(w,r oc | u t ,£ t -i). 
7: If 7^ 0, set k = 2, otherwise fc = 1. 
8: if E[i? t+fc I &] + fc 7 < E[i2t I 6] then 

9: Exit. 

10: end if 

11: If Da 7= 0, use F to query a new example j G Da without replacement, 

D T <- D T Uj. 
12: D T <- L> T U i, / <- F(Dt)- 
13: end for 



2.2.4 Specifics of the Model 

What remains unspecified is the set of convergence curves that will be employed. 
We shall make use of curves related to common theoretical convergence results. 
It is worthwhile to keep in mind that we simply aim to find the combination of 
the available estimates that gives the best predictions. While none of the esti- 
mates might be particularly accurate, we expect to obtain reasonable stopping 
times when they are optimally combined in the manner described in the previ- 
ous section. Ultimately, we expect to end up with a fairly narrow distribution 
over the possible convergence curves. 

One of the weakest convergence results [11] is for sample complexity of order 
0(l/ef), which corresponds to the convergence curve 



h q (t) = ^— K , «>1 (ID 

Another common type is for sample complexity of order 0(l/et), which corre- 
sponds to the curve 

h g (t) = j^, A>1 (12) 

A final possibility is that the error decreases exponentially fast. This is the- 
oretically possible in some cases, as was proven in [9]. The resulting sample 
complexity of order 0(log(l/e t )) corresponds to the convergence curve 

h exp (t) = l3 t , 18 G (0,1). (13) 



8 



Since we do not know what appropriate values of the constants j3, A and k, 
are, we will model this uncertainty as an additional distribution over them, i.e. 
p((3 | £ t ). This would be updated together with the rest of our belief distribution 
and could be done in some cases analytically. In this paper however we consider 
approximating the continuous densities by a sufficiently large set of models, one 
for each possible value of the unknown constants. 
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Figure 1: Illustration of the estimated error on a 10-class problem with a cost 
per label of 7 = 0.001. On the vertical axis, r t is the history of the predicted 
generalisation error, i.e E[rt | £t-i], while Rt is the generalisation error 
measured on a test-set of size 10,000 and Ct is the corresponding actual cost. 
Finally, R w and E[Ct] are the final estimated convergence and cost curves 
given all the observations. The stopping time is indicated by S, which equals 
0.5 whenever Alg. 2 decides to stop and t is the number of iterations. 

As a simple illustration, we examined the performance of the estimation and 
the stopping criterion in a simple classification problem with data of 10 classes, 
each with an equivariant Gaussian distribution in an 8-dimensional space. Each 
unknown point was simply classified as having the label closest to the empirical 
mean of the observations for each class. Examples were always chosen randomly. 

As can be seen in Fig. 1, at the initial stages the estimates are inaccurate. 
This is because of two reasons: (a) The distribution over convergence rates is 
initially dominated by the prior. As more data is accumulated, there is better 
evidence for what the final error will be. (b) As we mentioned in the discussion 
of step 6, the realised convergence curve is much more random than the ex- 
pected convergence curve which is actually modelled. However, as the number 
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of examples approaches infinity, the expected and realised errors converge. The 
stopping time for Alg. 2 (indicated by S) is nevertheless relatively close to the 
optimal stopping time, as C t appears to be minimised near 200. The following 
section presents a more extensive evaluation of this stopping algorithm. 

3 Experimental Evaluation 

The main purpose of this section is to evaluate the performance of the OBSV 
stopping algorithm. This is done by examining its cost and stopping time when 
compared to the optimal stopping time. Another aim of the experimental evalu- 
ation was to see whether mixed sampling strategies have an advantage compared 
to random sampling strategies with respect to the cost, when the stopping time 
is decided using a stopping algorithm that takes into account the labelling cost. 
Following [7], we plot performance curves for a range of values of 7, utilising 
multiple runs of cross-validation in order to assess the sensitivity of the results 
to the data. For each run, we split the data into a training set D and test 
set De, the training set itself being split into random and mixed sampling sets 
whenever appropriate. 

More specifically, we compare the OBSV algorithm with the oracle stopping 
time. The latter is defined simply as the stopping time minimising the cost as 
this is measured on the independent test set for that particular run. We also 
compare random sampling with mixed sampling. In random sampling, we 
simply query unlabelled examples without replacement. For the mixed sampling 
procedure, we actively query an additional label for the example from Da closest 
to the decision boundary of the current classifier, also without replacement. 
This strategy relies on the assumption that those labels are most informative 
[6], [4], [5] and thus convergence will be faster. Stopping times and cost ratio 
curves are shown for a set of 7 values, for costs as defined in (2). These values of 
7 are also used as input to the stopping algorithm. The ratios are used both to 
compare stopping algorithms (OBSV versus the oracle) and sampling strategies 
(random sampling, where Da = 0, and mixed sampling, with \Da\ = \Dr\). 
Average test error curves are also plotted for reference. 

For the experiments we used two data sets from the UCI repository 3 : the 
Wisconsin breast cancer data set (wdbc) with 569 examples and the spambase 
database (spam) with 4601 examples. We evaluated wdbc and spam using 5 and 
3 randomised runs of 3-fold stratified cross-validation respectively. The classifier 
used was AdaBoost [12] with 100 decision stumps as base hypotheses. Hence 
we obtain a total of 15 runs for wdbc and 9 for spam. We ran experiments for 
values of 7 G {9 • 10"*, 8 • \Q~ k , . . . , 1 • 10- fe }, with k = 1, . . . , 7, and 7 = 0. For 
every algorithm and each value of 7 we obtain a different stopping time t 1 for 
each run. We then calculate v e {^, F,t-y) as given in (2) on the corresponding 
test set of the run. By examining the averages and extreme values over all runs 
we are able to estimate the sensitivity of the results to the data. 

3 http://mlearn. ics.uci.edu/MLRepository.html 
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The results comparing the oracle with OBSV for the random sampling 
strategy 4 are shown in Fig. 2. In Fig. 2(a), 2(b) it can be seen that the stopping 
times of OBSV and the oracle increase at a similar rate. However, although 
OBSV is reasonably close, on average it regularly stops earlier. This may be due 
to a number of reasons. For example, due to the prior, OBSV stops immediately 
when 7 > 3 • 1CP 2 . At the other extreme, when 7^0 the cost becomes the test 
error and therefore the oracle always stops at latest at the minimum test error 5 . 
This is due to the stochastic nature of the realised error curve, which cannot 
be modelled; there, the perfect information that the oracle enjoys accounts for 
most of the performance difference. As shown in Fig. 2(c), 2(d), the extra cost 
induced by using OBSV instead of the oracle is bounded from above for most of 
the runs by factors of 2 to 5 for wdbc and around 0.5 for spam. The rather higher 
difference on wdbc is partially a result of the small dataset. Since we can only 
measure an error in quanta of any actual performance gain lower than 

this will be unobservable. This explains why the number of examples queried 
by the oracle becomes constant for a value of 7 smaller than this threshold. 
Finally, this fact also partially explains the greater variation of the oracle's 
stopping time in the smaller dataset. We expect that with larger test sets, the 
oracle's behaviour would be smoother. 

The corresponding comparison for the mixed sampling strategies is shown 
in Fig. 4(a), 4(b). We again observe the stopping times to increase at a similar 
rate, and OBSV to stop earlier on average than the oracle for most values of 7 
(Fig. 3(a), 3(b)). Note that the oracle selects the minimum test error at around 
180 labels from wdbc and 1300 labels from spam, which for both data sets is 
only about a half of the number of labels the random strategy needs. OBSV 
tracks these stopping times closely. Over all, the fact that in both mixed and 
random sampling, the stopping times of OBSV and the oracle are usually well 
within the extreme value ranges, indicates a satisfactory performance. 

Finally we compare the two sampling strategies directly as shown in Fig. 4, 
using the practical OBSV algorithm. As one might expect from the fact that 
the mixed strategy converges faster to a low error level, OBSV stops earlier or 
around the same time using the mixed strategy than it does for the random 
(Fig. 4(c), 4(d)). Those two facts together indicate that OBSV works as in- 
tended, since it stops earlier when convergence is faster. The results also show 
that when using OBSV as a stopping criterion mixed sampling is equal to or bet- 
ter than random sampling [Fig. 4(c), 4(f)]. However the differences are mostly 
not very significant. 

4 Discussion 

This paper discussed the interplay between a well-defined cost function, stopping 
algorithms and objective evaluation criteria and their relation to active learning. 
Specifically, we have argued that (a) learning when labels are costly is essentially 

4 The corresponding average test errors can be seen in Fig. 4(a), 4(b). 
5 This is obtained after about 260 labels on wdbc and 2400 labels on spam 
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a stopping problem (b) it is possible to use optimal stopping procedures based 
on a suitable cost function (c) the goal of active learning algorithms could also 
be represented by this cost function, (d) metrics on this cost function should be 
used to evaluate performance and finally that, (e) the stopping problem cannot 
be separately considered from either the cost function or the evaluation. To our 
current knowledge, these issues have not yet been sufficiently addressed. 

For this reason, we have proposed a suitable cost function and presented 
a practical stopping algorithm which aims to be optimal with respect to this 
cost. Experiments with this algorithm for a specific prior show that it suffers 
only small loss compared to the optimal stopping time and is certainly a step 
forward from ad-hoc stopping rules. 

On the other hand, while the presented stopping algorithm is an adequate 
first step, its combination with active learning is not perfectly straightforward 
since the balance between active and uniform sampling is a hypcrparameter 
which is not obvious how to set. 6 An alternative is to use model-specific stopping 
methods. This could be done if we restrict ourselves to probabilistic classifiers, 
as for example in [1]; in this way we may be able to simultaneously perform 
optimal example selection and stopping. If such a classifier is not available 
for the problem at hand, then judicious use of frcquentist techniques such as 
bootstrapping [13] may provide a sufficiently good alternative for estimating 
probabilities. Such an approach was advocated by [14] in order to optimally 
select examples; however in our case we could extend this to optimal stopping. 
Briefly, this can be done as follows. Let our belief at time t be £ t , such that 
for any point x E X, we have a distribution over y, P(y \ x,£t)- We may now 
calculate this over the whole dataset to estimate the realised generalisation error 
as the expected error given the empirical data distribution and our classifier 

EzjK I &) = -!- ^[l-argrnaxP( yi = y la^t)]. (14) 

We can now calculate (14) for each one of the different possible labels. So we 
calculate the expected error on the empirical data distribution if we create a new 
classifier from £ t by adding example i as 

EdK I Xi^t) = = y I ^>6)EdK I Xi,yi = y,£t) (15) 

yey 

Note that P(t/j = y | Xi,^ t ) is just the probability of example i having label y 
according to our current belief, £ t . Furthermore, E D (v t | Xi,yt = y,£t) results 
from calculating (14) using the classifier resulting from £ t and the added example 
i with label y. Then En(v t , £t) — E£>(w t | Xi,£ t ) will be the expected gain from 
using i to train. The (subjectively) optimal 1-step stopping algorithm is as 
follows: Let i* = argmimED^ | Stop if~E D (v t | £t)-E D (^ | < 

7- 

A particular difficulty in the presented framework, and to some extent also 
in the field of active learning, is the choice of hyperparameters for the classifiers 

6 In this paper, the active and uniform sampling rates were equal. 
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themselves. For Bayesian models it is possible to select those that maximise 
the marginal likelihood. 7 One could alternatively maintain a set of models with 
different hyperparameter choices and separate convergence estimates. In that 
case, training would stop when there were no models for which the expected gain 
was larger than the cost of acquiring another label. Even this strategy, however, 
is problematic in the active learning framework, where each model may choose 
to query a different example's label. Thus, the question of hyperparameter 
selection remains open and we hope to address it in future work. 

On another note, we hope that the presented exposition will at the very 
least increase awareness of optimal stopping and evaluation issues in the active 
learning community, lead to commonly agreed standards for the evaluation of 
active learning algorithms, or even encourage the development of example se- 
lection methods incorporating the notions of optimality suggested in this paper. 
Perhaps the most interesting result for active learning practitioners is the very 
narrow advantage of mixed sampling when a realistic algorithm is used for the 
stopping times. While this might only have been an artifact of the particular 
combinations of stopping and sampling algorithm and the datasets used, we 
believe that it is a matter which should be given some further attention. 
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Figure 2: Results for random sampling on the wdbc (left column) and the 
spam data (right column) as obtained from the 15 (wdbc) and 9 (spam) runs of 
AdaBoost with 100 decision stumps. The first row (a), (b), plots the average 
stopping times from OBSV and the oracle as a function of the labelling cost 7. 
For each 7 the extreme values from all runs are denoted by the dashed lines. 
The second row, (c), (d), shows the corresponding average ratio in v e over all 
runs between OBSV and the oracle, where for each 7 the 3 rd (wdbc) / 2 nd (spam) 
extreme values from all runs are denoted by the dashed lines. Note a zero value 
on a logarithmic scale is denoted by a cross or by a triangle. Note for wdbc 
and smaller values of 7 the average ratio in v e sometimes exceeds the denoted 
extreme values due to a zero test error occurred in one run. 
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Figure 3: Results for mixed sampling on the wdbc (left column) and the 
spam data (right column) as obtained from the 15 (wdbc) and 9 (spam) runs of 
AdaBoost with 100 decision stumps. The first row (a), (b), plots the average 
stopping times from OBSV and the oracle as a function of the labelling cost 7. 
For each 7 the extreme values from all runs are denoted by the dashed lines. 
The second row, (c), (d), shows the corresponding average ratio in v e over all 
runs between OBSV and the oracle, where for each 7 the 3 rd (wdbc) / 2 nd (spam) 
extreme values from all runs are denoted by the dashed lines. Note a zero value 
on a logarithmic scale is denoted by a cross. 
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Figure 4: Results comparing random (RAND) and mixed (MIX) sampling on 
the wdbc (left column) and the spam data (right column) as obtained from the 
15 (wdbc) and 9 (spam) runs of AdaBoost with 100 decision stumps. The first 
row (a), (b), shows the test error of each sampling strategy averaged over all 
runs. The second row (a), (b), plots the average stopping times from OBSV 
and the oracle as a function of the labelling cost 7. For each 7 the extreme 
values from all runs are denoted by the dashed lines. The third row, (c), (d), 
shows the corresponding average ratio in v e over all runs between OBSV and 
the oracle, where for each 7 the 3 rd (wdbc) / 2 nd (spam) extreme values from all 
runs are denoted by the dashed lines. Note a zero value on a logarithmic scale 
is denoted by a cross. 
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